Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ComputeDomain for running multi-node workloads #225

Merged
merged 72 commits into from
Feb 20, 2025

Conversation

klueska
Copy link
Collaborator

@klueska klueska commented Jan 9, 2025

No description provided.

@klueska klueska force-pushed the add-multi-node-crd branch 10 times, most recently from 38065b4 to 3e51cd8 Compare January 14, 2025 08:15
@klueska klueska force-pushed the add-multi-node-crd branch 20 times, most recently from 4ce9bdb to 0d435d8 Compare January 22, 2025 15:56
For now just mark one well-known erro as permanent. Future commits will
abstract this better and mark more errors as permanaent.

Signed-off-by: Kevin Klues <[email protected]>
@klueska klueska force-pushed the add-multi-node-crd branch 2 times, most recently from df90001 to 764c3c7 Compare February 19, 2025 15:43
// ComputeDomainSpec provides the spec for a ComputeDomain.
type ComputeDomainSpec struct {
NumNodes int `json:"numNodes"`
Channel *ComputeDomainChannelSpec `json:"channel"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should channel be optional thinking on non imex use cases in the future, I know currently we are solely focused on imex support, but if we want to carry on the concept of computeDomain, we might face clusters without imex (channels)

@@ -0,0 +1,89 @@
/*
* Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe like some clothing brands do Since 1987* but I am not lawyer, maybe the number on the license header has a deeper legal meaning

@@ -0,0 +1,49 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION. All rights reserved.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is DRA 2022 old?

tail -f /dev/null & wait
fi
/usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg
tail -n +1 -f /var/log/nvidia-imex.log & wait
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed in a sync meeting a while ago: here we give up control of the IMEX daemon process (do we? how does errexit behave when a daemonized process exits non-zero?). In any case, for robustness and debuggability it will be good to actively monitor the health of the IMEX daemon process (polling the process, or better: getting a health signal actively and straight from the process). I'd like to look into that at some point, after merge.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still a problem. If the daemon crashes we will not exit (but the liveness probe will eventually fail and the pod will be restarted). We should make it more robust as a followup (probably by not doing everything in bash but instead writing a small go utility).

@klueska klueska merged commit d1fad7e into NVIDIA:main Feb 20, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants